A Tail Bound for Read-/c Families of Functions 
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CN [ Abstract 

We prove a Chcrnoff-likc large deviation bound on the sum of non-independent ran- 
dom variables that have the following dependence structure. The variables Yi,...,Y r 
are arbitrary Boolean functions of independent random variables X\, . . . , X m , modulo a 
restriction that every Xi influences at most k of the variables Yy , . . . , Y r . 



1 Introduction 



qq \ Let Yy, . . . ,Y r be independent indicator random variables with Pr [Yj = 1] = p and let Y = 

t*"*- ' Y± + . . . +Y r denote their sum. The average number of variables which are set is E [Y] = pr. 

A basic question is how concentrated is the sum around its average. The answer is given 
by Chernoff bound: for any e > 0, the probability that the fraction of variables that occur 
exceeds the expectation by more than e is bounded by 



Pr [Yy + . . . + Y r > {p + e)r] < e - D ( p+£ ^> r < e" 2 ^, 
and similarly, the probability that it is below the expectation by more than e is bounded by 



Pr [Yi + . . . + Y r < (p - e)r] < e ~ D( - p ~ £ ^ r < 
Here, D {q\\p) is Kullback-Leibler divergence defined as 



-2e z r 



£>(g||p) = giog - +( 1 -'?)iog 



p J V i — p 

and the latter estimate (which is the one more commonly used in applications) follows from 
D(q\\p) > 2(q-p) 2 . 

Our focus in this paper is on deriving similar estimates when the variables Y± , . . . , Y r are 
not independent, but the level of dependence between them is in some sense 'weak'. These 
type of questions are commonly studied in probability theory, as they allow one to apply 
Chernoff-like bounds to more general scenarios. For example, one case which is used in 
various applications is where the variables Yy, . .. ,Y r are assumed to be A;- wise independent. 
In this case, Bellare and Rompel jBR94j obtained Chernoff-like tail bounds on the probability 



that the number of set variables deviates from its expectation. Another well studied case is 
when Y\, . . . ,Y r form a martingale, in which case Azuma inequality and its generalizations 
give bounds which are comparable to Chernoff bound. 

We consider in this paper another model of weak dependence. Assume that the variables 
Yi, . . . ,Y r can be factored as functions of independent random variables X\, . . . , X m . More 
concretely, each Yj is a function of a subset of the variables X\ , . . . , X m . One extreme case 
is that these subsets are disjoint, in which case the variables Yi,...,Y r are independent. The 
other extreme case is when these subsets all share a common element, in which case Y\, . . . , Y r 
can be arbitrary. We say that Y\ , . . . , Y r are a read-k family if there exists such a factorization 
where each Xi influences at most k of the variables Y\ , . . . , Y r . 

Definition 1. (Read- A; families) Let X±, . . . , X m be independent random variables. For j £ 
[r], let Pj C [m] and let fj be a Boolean function of (Xj) igP .. Assume that \{j \ i € Pj}\ < k 
for every i 6 [m]. Then the random variables Yj = /^((Xj^gp^.) are called a read-k family. 

There are several motivations for studying read-A; families of functions: for example, 
they arise naturally when studying subgraphs counts in random graphs; or in generalizations 
of read-once models in computational complexity. We will not discuss applications in this 
paper, but instead focus on the basic properties of read-A; families. Our main result is that 
Chernoff-like tail bounds hold for read-A; families, where the bounds degrade as k increases. 

Theorem 1.1. Let Y\, . . . , Y r be a family of read-k indicator variables with Pr [Yi = 1] = pi, 
and let p be the average ofpi, . . . ,p r . Then for any e > 0, 

Pr [Y x + . . . + Y r >{p + e)r] < e ~ D ^ +£ ^ r / k 

and 

Pr [Yi + . . . + Y r <{p- e)r] < e -0(p-e||p)-r/fc_ 

That is, we obtain the same bound as that of the standard Chernoff bound, except that 
the exponent is divided by k. This is clearly tight: let Y\ = . . . = Y/- = X\, Yfc+i = . . . = 
Y2k = X2, etc, where X\, X%, . . . , X r / k are independent with Pr[Xj = 1] = p. Then for 
example 

Pr [Yi + . . . + Y r > (p + e)r] = Pr[X 1 + ...+ X r/k > (p + e)r/k] . 

We note that concentration bounds for Y = Yi may also be obtained by observing 
that Y is a A>Lipschitz function of the underlying independent random variables X±, . . . , X m 
and then applying standard Martingale-based arguments, such as Azuma's inequality (see, 
for example, jDP09j ). However, these techniques are limited in the sense that they only 
yield concentration bounds when the deviation e ■ r above is at least y/m or so, whereas our 
results yield concentration bounds even when the deviation is roughly the square root of the 
mean p • r, which may be much smaller. It is also known (see [VonlOl Section 4]) that such 
'dimension-free' concentration bounds cannot be obtained for Lipschitz functions in general. 

1.1 Proof overview 

Let Yi, . . . , Y r be a read-A; family with Pr[Yi = 1] = pj. Let us consider for a moment a more 
basic question: what is the maximal probability that Y\ = . . . = Y r = 1? The answer to this 
question is given by Shearer's lemma [CGFS86] . which is a simple yet extremely powerful 
tool. 
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Theorem 1.2. Let Y\, . . . , Y r be a family of read-k indicator variables with Pr [Yj = 1] = p. 
Then 

Pr[Y 1 = ... = Y r = l}< f/ k . 



Note that the answer is the fc-th root of the answer in the case where the variables are 
independent, similar to what we get for the tail bounds. The result is tight, as can be seen 
by taking Y\ = . . . = Y k = Xi, Y k+ i = . . . = Y 2 k = X 2 , etc, where Xi,X 2 , . . . ,X r / k are 
independent with Pr [Xi = 1] = p. The proof of Shearer's lemma is based on the entropy 
method. We refer the interested reader to a survey of Radhakrishnan [Rad03] on applications 
of Shearer's lemma; and to continuous analogs by Finner |Fin92j and Friedgut |Fri04j . 

We derive Theorem II .11 by constructing an information-theoretic proof of Chernoff bound 
and applying Shearer's lemma to make the proof robust for the case of non-independent 
random variables forming a read-A; family. In our proof we use the "entropy method" that 
was introduced by Ledoux |Led96| . and more recently has been used to prove a number of 
concentration inequalities (cf. [BLM03 ). From a technical point of view, we construct analogs 
of Shearer's lemma for Kullback-Leibler divergence. 

2 Preliminaries 

Let X be a random variable and A be a subset of its support. We write l X E A" to address 
the event "the value of X belongs to A" . 

2.1 Entropy 

If X is distributed according to /x, then we will write both H [X] and H [p] to denote the 
entropy Ex=i [log(l///(x))]. All logarithms in this paper are natural. If X is supported in 
the set A then H [X] < log|^4|, where equality holds when X is the uniform distribution 
over A. For two random variables X, Y we denote their conditional entropy by H [-X"|Y] = 
By= v [H [X\Y = y}]. We note that always H [X] > H [X\Y]. If X 1 ,...,X m are random 
variables then H [X±, . . . , X m ] denotes the entropy of their joint distribution. The chain rule 
of entropy is 



2.2 Relative entropy 

Let ji' and ji be distributions defined over the same discrete domain A. The relative entropy 
(or Kullback-Leibler divergence) between fi' and fj, is defined as 



If random variables X, X' are distributed according to fi, //, accordingly, then we write 
D (X'\\X) = D Kullback-Leibler divergence also satisfies a chain rule. We will only 

need the following simple corollary of it. 



H [Xi , . . . , X m ] = H + H [X 2 \Xt] + . . . + H [X m \X u . . . , X m ^}. 
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Fact 2.1. Let X,X' be random variables defined on a domain A. Let <j> : A — >■ B be some 
function. Then 

D (X'\\X) > D (</>(X')\\<l>(X)) . 

We will usually use relative entropy in the case when the second operand is the uniform 
distribution over some set. The following observations will be useful. 

Fact 2.2. Let fx be the uniform distribution over some set A, and let // be any distribution 
over A. Then 

D(fi'\\n) =H[m]-H[//]. 

Claim 2.3. Let \x be the uniform distribution over some set A, and let A' C A satisfy 
fi(A') = p. Let fj,' be any distribution over A, satisfying fJ,'(A') = q. Then 

D(fJL'\\^>D(q\\p). 

Here we use the standard convention D (q\\p) to denote the relative entropy between 
Bernoulli distributions whose probabilities of positive outcomes are, respectively, q and p. 

Proof. Let X be a random variable taking values in A according to the distribution fi', and 
let I be the indicator of the event ll X £ A'" . Let H [q] denote the entropy of Bernoulli 
distribution whose probability of positive outcome is q, then 

H [//] = H [/] + H [X\I] < H [q] + glog (\A'\) + (1 - q) log (\A\ - \A'\) 
= H [q] + glog (p \A\) + {l-q) log ((1 - p) \A\) 
= H [q] + log \A\ + qlogp + (1 - q) log(l - p) 
= n\p\-D{q\\p), 

and the result follows by Fact 12.21 ■ 

We will need the following facts on D {q\\p). The first is the convexity of Kullback-Leibler 
divergence. The last two relate to monotonicity of Kullback-Leibler divergence. 

Fact 2.4. Let < pi,P2, qi, q%, A < 1. Then 

D (Api + (1 - X)p 2 \\Xqi + (1 - A)g 2 ) < A • D ( Pl \\ qi ) + (1 - A) • D (p 2 ||g 2 ) • 

Fact 2.5. Let <p < q < q' < 1. Then D (q'\\p) > D(q\\p). 

Fact 2.6. Let < g' < q<p< 1. Then D(q'\\p) > D(q\\p). 

2.3 Shearer's Lemma 

In order to derive bounds on mutual entropy we will use an elegant tool of amazing universal- 
ity, Shearer's Lemma [CGFS86] . The following formulation of the lemma will be convenient 
for us. 
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Lemma 2.7. [Shearer's Lemma] Let X±, . . . , X m be random variables and let P±, . . . , P r C [m] 

be subsets such that for every i G [m] it holds that \{j \ i G Pj}\ > k. Then 

r 

k-H[(X 1 ,...,X m )] <X)H[M ie p,]. 

i=i 

For completeness we include a short proof, based on the idea commonly attributed to 
Jaikumar Radhakrishnan. 

Proof. Let us denote Sj = \Pj\ and let Pj = ■ ■ ■ , ij,sj}, where we order the elements 
< . . . < ij, Sj - We apply the chain rule for entropy: 

H [PQ W,] = E H [X i]it \X ijtl , . . . , X^] > J2 H [XiJX u X tj ^] , 
t=i t=i 

where the second inequality follows from non-increasing of entropy as a result of conditioning. 
Summing over all 1 < j < r we obtain that 

r m m 

X)H [MiepJ >^2\{j\i£ Pj}\ ■ H [Xi\X!, . . > • . . . , 

j=l i=l i=l 

and the result follows. ■ 
We need the following simple corollary of Shearer's Lemma: 

Corollary 2.8. Let X\, . . . ,X m be random variables and let P\, . . . , P r C [m] be such that 
for every i G [m] it holds that \{j \i G Pj}\ < k. For each i G [m], let Y{ be an independent 
random variable, uniformly distributed over a set that includes the support of JQ. Then 

r 

k ■ D ((Xi)Zi\\(Yi)Zi) >^,D {(Xi) ieP .\\(Yi,)i e p.) . 

i=i 

Note the difference in the conditions put upon \{j \i G Pj}\- in Shearer's Lemma it was 
"> k" , and in the corollary it is "< k" . 

Proof. First observe that w.l.o.g we can assume that \{j \ i G Pj}\ = k for every i (adding 
elements to some Pj can only increase the right-hand side of the stated inequality by Fact 12. ip . 
Under this assumption we can apply Lemma \2.7\ concluding that 

k-nKx^^j^Hiix^}. 

Applying Claim [2T3l leads to 

k ■ d ((x^um^U) = kum)T =l ] - fcH prO£i] 



^kHm^-J^niix^p. 



7 = 1 ^ ' 
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where the last equality follows from mutual independence of Yj's and the assumption that 
every i belongs to exactly k sets among P±, . . . ,P r . The result follows. ■ 



3 Read- A; families of functions 



We prove Theorem 1 1.1 1 in this section. We first fix notations. Let X±, . . . , X m be independent 
random variables. For j £ [r], let Pj C [m] be a subset and let Yj = fj((Xi) ieP .). We 
assume that Y±, . . . ,Y r are a re&d-k family, that is, \{j \ i & Pj}\ < for every i E [m]. Let 
Pr [Y| = 1] = pj, and let p be the average of pi, . . . ,p r . Let e > be fixed. We will show that 



Pr [Yi + . . . + Y r > (p + e)r] < e 



-D(p+e\\p)-r/k 



(1) 



and 



Pr [Yi + . . . + Y r < (p - e)r] < e 



-D(p—e\\p)-r/k 



(2) 



We first note that it suffices to prove the theorem in the case where each X$ is uniform over 
a finite set Aj. This is since we can assume w.l.o.g that each Xi is a discrete random variable, 
and any discrete distribution can be approximated to arbitrary accuracy by a function of a 
uniform distribution over a large enough finite set. We denote by A = f A\ x . . . x A m the set 
of all possible inputs, and let [i denote the uniform distribution over A. 

We start by proving ([T|). For t = (p + e)r let us denote by A— the set of inputs for which 

E fi > t, 



A- 1 = f { (a u . . . , am) G A x x • • • x A % 



We denote by be uniform distribution over A- 1 . We next define restrictions of these 
distributions to the sets P±, . . . , P m . For a set Pj C [m] we denote by fj,j the restriction of 



to the coordinates of Pj (it is the uniform distribution over JlieP ■ -^i)) anc ^ ^ * ne 
restriction of to the coordinates of Pj. 

We wish to upper bound the probability that fj — ^- Equivalently, we wish to upper 
bound Taking logarithms, this is equal to H [/■*-*] — H [fi\. By Fact 12. 2| this is 

equivalent to D /i). That is, 



Pr 

/' 



> t 



\A 



exp ( H [//-*] -HM I = exp (-D • 



(3) 



So, we need to obtain a lower bound on Naturally, to do that we will use 

Corollary 12.81 We thus have 



> t 



< 



exp 



A- 



•r 



(4) 



Recall that = Pr^ [/j = 1] is the probability that fj = 1 under the uniform distribution. 
Let us denote by qj = Pr^i [fj = 1] the probability that fj = 1 conditioned on fj > t. 
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By Fact 12.11 (using <fi = fj) we have that D ^j'jj/Uj^ > D (qj\\pj), hence 



> t 



<exp(--Y,D(q j \\ Pj )). 



(5) 



j=l 



By convexity of Kullback-Leibler divergence (Fact 12. 4p we have that 

lY,n( qj \\ Pj )>D( q \\p), 

3=1 

where q = ^Y7j=iQj an d P = ^Y7j=iPj- To conclude, recall that qj is the probability 
that fj = 1 given that fi + ... + f r > t. Hence the sum Y7j=i 1j ls the expected number 
of fj for which /,• = 1, which by definition is at least t. Hence q > t/r = p + e and 
D (q\ \p) > D (p + e\ \p) by Fact 12.51 We have thus shown that 

Pr[/i + ... + / r > (p + e)r] < exp(-L> Op + e\\p) • r/fc). 

which establishes ([I]). 

The proof of ([2]) is very similar. Define for t = (p — e)r the set A— as 



.4 



<t 4£. f 



(ai, . . . ,a m ) G Ai x • • • x A T 



y^Jjjai, ■ ■ ■ ,a m ) <tf. 

3=1 



Let /i-* be the uniform distribution over A-*, and let be its projection to (JQ^gp^. . Then 
analogously to (jlj) we have 



< t 



< 



cxp 



(6) 



Let = Pr^i [/j = 1] be the probability that fj = 1 conditioned on fj — Then 
analogously to (0) we have 



Let q 1 = \ Y?j=i Qj- Then by convexity of Kullback-Leibler divergence, we have 



-Y j D{ qj \\ Pj )>D{q'\\p). 



Pr [/x + . . . + / r < (p — e)r] < exp(— D (p — e\\p) ■ r/k). 



which establishes ([2]) 



(7) 



3=1 

Now, q' < t/r = p — e by definition, hence by Fact 12.61 D (q'\\p) > D(p — e\\p). So, we 
conclude that 
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